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Abstract — In a "tipping" model, each node in a social network, 
representing an individual, adopts a behavior if a certain number 
of his incoming neighbors previously held that property. A key 
problem for viral marketers is to determine an initial "seed" 
set in a network such that if given a property then the entire 
network adopts the behavior. Here we introduce a method for 
quickly finding seed sets that scales to very large networks. Our 
approach finds a set of nodes that guarantees spreading to the 
entire network under the tipping model. After experimentally 
evaluating 31 real-world networks, we found that our approach 
often finds such sets that are several orders of magnitude smaller 
than the population size. Our approach also scales well - on a 
Friendster social network consisting of 5.6 million nodes and 
28 million edges we found a seed sets in under 3.6 hours. We 
also find that highly clustered local neighborhoods and dense 
network-wide community structure together suppress the ability 
of a trend to spread under the tipping model. 

I. Introduction 

A much studied model in network science, tipping pO) , 
|[lT|, I^Ol (a.k.a. deterministic linear threshold fl^) is often 
associated with "seed" or "target" set selection, |7| (a.k.a. 
the maximum influence problem). In this problem we have a 
social network in the form of a directed graph and thresholds 
for each individual. Based on this data, the desired output is 
the smallest possible set of individuals such that, if initially 
activated, the entire population will adopt the new behavior (a 
seed set). This problem is NP-Complete [9J, |12l. Although 
approximation algorithms have been proposed7j3), ||7|, |[8), 
^5] none seem to scale to very large data sets. Here, inspired 
by shell decomposition, |[2|, ||5|, p3) we present a method 
guaranteed to find a set of nodes that causes the entire popula- 
tion to activate - but is not necessarily of minimal size. We then 
evaluate the algorithm on 31 large real-world social networks 
and show that it often finds very small seed sets (often several 
orders of magnitude smaller than the population size). We also 
show that the size of a seed set is related to Louvain modularity 
and average clustering coefficient. Therefore, we find that 
dense community structure and tight-knit local neighborhoods 
together inhibit the spreading of trends under the tipping 
model. 

The rest of the paper is organized as follows. In Section [H] 
we provide formal definitions of the tipping model. This 
is followed by the presentation of our new algorithm in 



Section |IV| F inally, we provide an overview of related work 
in Section rvl 

II. Technical Preliminaries 

Throughout this paper we assume the existence of a social 
network, G — (V, E), where y is a set of vertices and £' is a 
set of directed edges. We will use the notation n and m for the 
cardinality of V and E respectively. For a given node Uj G V, 
the set of incoming neighbors is 77"', and the set of outgoing 
neighbors is 77°"*. The cardinalities of these sets (and hence 
the in and out degrees of node 1;^) are respectively. 
We now define a threshold function that for each node returns 
the fraction of incoming neighbors that must be activated for 
it to become activate as well. 

Definition 1 (Threshold Function): We define the thresh- 
old function as mapping from V to (0, 1]. Formally: : V" — > 
(0,1]. 

For the number of neighbors that must be active, we will 
use the shorthand fc^. Hence, for each Vi, ki = \0{vi) ■ d*"]. 
We now define an activation function that, given an initial set 
of active nodes, returns a set of active nodes after one time 
step. 

Definition 2 (Activation Function): Given a threshold func- 
tion, 0, an activation function Aq maps subsets of V to 
subsets of V, where for some V' C V, 



Ae{V') ^V'\J{v,^V s.t. |7?r n y I > h} 



(1) 



We now define multiple applications of the activation func- 
tion. 

Definition 3 (Multiple Applications of the Activation Function) 
Given a natural number i > 0, set V' C V, and threshold 
function, 6, we define the multiple applications of the 
activation function, Al{V'), as follows: 



AeiV) 
AeiAl-\V')) 



if i = 1 
otherwise 



(2) 



Section III We then describe our experimental results in 



Clearly, when Al{V') = Al^^iV) the process has con- 
verged. Further, this occurs in no more than n steps (as, in 
each step, at least one new node must be activated). Based 
on this idea, we define the function F which returns the set 
of all nodes activated upon the convergence of the activation 
function. 



Definition 4 (T Function): Let j be the least value such that 
AI{V') = A^fi'^iV). We define the function : 2^ ^ 2^ 
as follows. 

Tg{V') = J^,{V') (3) 

We now have all the pieces to introduce our problem - 
finding the minimal number of nodes that are initially active 
to ensure that the entire set V becomes active. 

Definition 5 (The MIN-SEED Problem): The MIN-SEED 
Problem is defined as follows: given a threshold function, 6, 
return V' C V s.t. TeiV) = V, and there does not exist 
V" C V where \V"\<\V'\ and Tg{V") = V. 

The following theorem is from the literature |j9|, p2) and 
tells us that the MIN-SEED problem is NP-complete. 

Theorem 1 (Complexity of MIN-SEED ^): MIN- 
SEED in NP-Complete. 

III. Algorithm 

To deal with the intractability of the MIN-SEED problem, 
we design an algorithm that finds a non-trivial subset of nodes 
that causes the entire graph to activate, but we do not guarantee 
that the resulting set will be of minimal size. The algorithm 
is based on the idea of shell decomposition often cited in 
physics literature |2J, Q, p3] , plj but modified to ensure 
that the resulting set will lead to all nodes being activated. 
The algorithm, TIP_DECOMP is presented in this section. 



Algorithm 1 TIP_DECOMP 

Require: Threshold function, and directed social network 

G^{V,E) 
Ensure: V 

1: For each vertex vi, compute A:,. 

2: For each vertex u^, disti — — ki. 

3: FLAG = TRUE. 

4: while FLAG do 

5: Let Vi be the element of v where disti is minimal. 

6: if disti — oo then 

7: FLAG = FALSE. 

8: else 

9: Remove Vi from G and for each vj in 77°"*, if distj > 
0, set distj = distj — 1. Otherwise set distj — 00. 

10: end if 

11: end while 

12: return All nodes left in G. 



Intuitively, the algorithm proceeds as follows (Figure 1). 
Given network G — (V, E) where each node Vi has threshold 
ki = \9{vi) ■ dl^l, at each iteration, pick the node for which 
d"' — ki is the least but positive (or 0) and remove it. Once 
there are no nodes for which d™ — ki is positive (or 0), the 
algorithm outputs the remaining nodes in the network. 

Now, we prove that the resulting set of nodes is guaranteed 
to cause all nodes in the graph to activate under the tipping 
model. This proof follows from the fact that any node removed 
is activated by the remaining nodes in the network. 




Fig. 1. Example of our algorithm for a simple network depicted in box A. 
We use a threshold value set to 50% of the node degree. Next to each node 

label (lower-case letter) is the value for d^" — ki (where ki = [-|-])- In the 
first four iterations, nodes e, f, h, and i are removed resulting in the network 
in box B. This is followed by the removal of node j resulting in the network 
in box C. In the next two iterations, nodes a and b are removed (boxes D- 
E respectively). Finally, node c is removed (box F). The nodes of the final 
network, consisting of d and g, have negetive values for di — 9i and become 
the output of the algorithm. 

Theorem 2: If all nodes in y C V returned by 
TIP_DECOMP are initially active, then every node in V will 
eventually be activated, too. 

Proof: Let w be the total number of nodes removed by 
TIP_DECOMP, where Vi is the last node removed and is 
the first node removed. We prove the theorem by induction on 
w as follows. We use P{w) to denote the inductive hypothesis 
which states that all nodes from vi to are active. In the 
base case, P(l) trivially holds as we are guaranteed that from 
set V' there are at least ki edges to vi (or it would not be 
removed). For the inductive step, assuming P{w) is true, when 
Vw+i was removed from the graph dist^+i > which means 
that d^^\i > /cuj+i. All nodes in r|^_^_l at the time when v^+i 
was removed are now active, so v^+i will now be activated - 
which completes the proof. ■ 

We also note that by using the appropriate data structure (we 
used a binomial heap in our implementation), for a network 
of n nodes and m edges, this algorithm can run in time 
(9 (to log n). 

Proposition 1: The complexity of TIP_DECOMP is 0{m- 

login)). 

IV. Results 

All experiments were run on a computer equipped with an 
Intel X5677 Xeon Processor operating at 3.46 GHz with a 
12 MB Cache. The machine was running Red Hat Enterprise 
Linux version 6.1 and equipped with 70 GB of physical 
memory. TIP_DECOMP was written using Python 2.6.6 in 
200 lines of code that leveraged the NetworkX library available 
from http://networkx.lanl.gov/. The code used a binomial 
heap library written by Bjorn B. Brandenburg available from 



http://www.cs.unc.edu/~'bbb/. All statistics presented in this 
section were calculated using R 2.13.1. 

A. Datasets 

In total, we examined 31 networks: nine academic col- 
laboration networks, three e-mail networks, and 19 networks 
extracted from social-media sites. The sites included included 
general-purpose social-media (similar to Facebook or MyS- 
pace) as well as special-purpose sites (i.e. focused on sharing 
of blogs, photos, or video). 

All datasets used in this paper were obtained from one of 



four sources: the ASU Social Computing Data Repository, |23 1 



the Stanford Network Analysis Project, p4) the University 
of Michigan, fTT| and Universitat Rovira i Virgili. f\\ All 
networks considered were symmetric - i.e. if a directed edge 
from vertex v to v' exists, there is also an edge from vertex v' 
to V. Tables |l] (A-C) show some of the pertinent qualities of 
these networks. The networks are categorized by the results 
(explained later in this section). In what follows, we provide 
their real-world context. 

B. Category A 

• BlogCatalog is a social blog directory that allows users 
to share blogs with friends. [23J The first two samples of 
this site, BlogCatalog 1 and 2, were taken in Jul. 2009 and 
June 2010 respectively. The third sample, BlogCatalog3 
was uploaded to ASU's Social Computing Data Reposi- 
tory in Aug. 2010. 

• Buzznet is a social media network designed for sharing 
photographs, journals, and videos. |23 1 It was extracted 
in Nov. 2010. 

• Douban is a Chinese social medial website designed to 
provide user reviews and recommendations. ||23J It was 
extracted in Dec. 2010. 

• Flickr is a social media website that allows users to 
share photographs. |23| It was uploaded to ASU's Social 
Computing Data Repository in Aug. 2010. 

• Flixster is a social media website that allows users to 
share reviews and other information about cinema. \23j 
It was extracted in Dec. 2010. 

• Foursquare is a location-based social media site. f23l It 
was extracted in Dec. 2010. 

• Frienster is a general-purpose social-networking 
site. |23| It was extracted in Nov. 2010. 

• Last.Fm is a music-centered social media site. |23| It 
was extracted in Dec. 2010. 

• Livejournal is a site designed to allow users to share 
their blogs. |23| It was extracted in Jul. 2010. 

• Livemocha is touted as the "world's largest language 
community." |23| It was extracted in Dec. 2010. 

• WikiTalk is a network of individuals who set and re- 
ceived messages while editing WikiPedia pages. p4^ It 
was extracted in Jan. 2008. 

C. Category B 

• Delicious is a social bookmarking site, designed to allow 



users to share web bookmarks with their friends. |23| It 
was extracted in Dec. 2010. 

• Digg is a social news website that allows users to share 
stories with friends. | |23| It was extracted in Dec. 2010. 

• EU E-Mail is an e-mail network extracted from a large 
European Union research institution. |14| It is based on 
e-mail traffic from Oct. 2003 to May 2005. 

• Hyves is a popular general-purpose Dutch social net- 
working site. ||23J It was extracted in Dec. 2010. 

• Yelp is a social networking site that allows users to share 
product reviews. f23l It was extracted in Nov. 2010. 

D. Category C 

• CA-AstroPh is a an academic collaboration network for 



Astro Physics from Jan. 1993 - Apr 2003. {141 

• CA-CondMat is an academic collaboration network for 
Condense Matter Physics. Samples from 1999 (Cond- 
Mat99), 2003 (CondMat03), and 2005 (CondMat05) were 
obtained from the University of Michigan. 1 17 | A second 
sample from 2003 (CondMat03a) was obtained from 
Stanford University. |14| 

• CA-GrQc is a an academic collaboration network for 
General Relativity and Quantum Cosmology from Jan. 
1993 - Apr. 2003. |[T4| 

• CA-HepPh is a an academic collaboration network for 
High Energy Physics - Phenomenology from Jan. 1993 - 
Apr 2003. |14j 

• CA-HepTh is a an academic collaboration network for 
High Energy Physics - Theory from Jan. 1993 - Apr 
2003. 1 14 1 

• CA-NetSci is a an academic collaboration network for 
Network Science from May 2006. 

• Enron E-Mail is an e-mail network from the Enron cor- 
poration made public by the Federal Energy Regulatory 
Commission during its investigation. 1 14] 

• URV E-Mail is an e-mail network based on commu- 
nications of members of the University Rovira i Virgili 
(Tarragona). {Tj It was extracted in 2003. 

• YouTube is a video-sharing website that allows users 
to establish friendship links. |23| The first sample 
(YouTubel) was extracted in Dec. 2008. The second 
sample (YouTube2) was uploaded to ASU's Social Com- 
puting Data Repository in Aug. 2010. 

E. Runtime 

First, we examined the runtime of the algorithm (see Fig- 
ure [2]i. Our experiments aligned well with our time complexity 
result (Proposition [TJ. For example, a network extracted from 
the Dutch social-media site Hyves consisting of 1.4 million 
nodes and 5.5 million directed edges was processed by our 
algorithm in at most 12.2 minutes. The often-cited LiveJournal 
dataset consisting of 2.2 million nodes and 25.6 million 
directed edges was processed in no more than 66 minutes 
- a short time for an NP-hard combinatorial problem on a 
large-sized input. 
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Name 


# Nodes 


# Edges 


Avg. Degree 


Source 


Type 


CATEGORY A 


BlogCatalogl 


88,784 


4,186,390 


23.58 


ASU 


SocMedia 


BlogCatalog2 


97,884 


3,337,294 


17.05 


ASU 


SocMedia 


BlogCatalogS 


10,312 


667,966 


32.39 


ASU 


SocMedia 


Buzznet 


101,163 


5,526,132 


27.31 


ASU 


SocMedia 


Douban 


154,908 


654,324 


2.11 


ASU 


SocMedia 


Flickr 


80,513 


11,799,764 


73.28 


ASU 


SocMedia 


Flixster 


2,523,386 


15,837,602 


3.14 


ASU 


SocMedia 


Foursquare 


639,014 


6,429,972 


5.03 


ASU 


SocMedia 


Frienster 


5,689,498 


28,135,774 


2.47 


ASU 


SocMedia 


Last.Fm 


1,191,812 


9,038,680 


3.79 


ASU 


SocMedia 


LiveJournal 


2,238,731 


25,632,368 


5.72 


ASU 


SocMedia 


Livemocha 


104,103 


4,386,166 


21.07 


ASU 


SocMedia 


WikiTalk 


2,394,385 


9,319,130 


1.95 


SNAP 


SocMedia 


CATEGORY B 


Delicious 


536,408 


2,732,272 


2.55 


ASU 


SocMedia 


Digg 


771,231 


11,814,826 


7.66 


ASU 


SocMedia 


EU E-Mail 


265,214 


728,962 


1.37 


SNAP 


E-Mail 


Hyves 


1,402,673 


5,554,838 


1.98 


ASU 


SocMedia 


Yelp 


487,401 


4,686,962 


4.81 


ASU 


SocMedia 


CATEGORY C 


CA-AstroPh 


18,772 


396,100 


10.55 


SNAP 


Collab 


CA-CondMat03 


30,460 


240,058 


3.94 


UMiCH 


Collab 


CA-CondMat03a 


23,133 


186,878 


4.04 


SNAP 


Collab 


CA-CondMat05 


39,577 


351,384 


4.44 


UMiCH 


Collab 


CA-CondMat99 


16,264 


95,188 


2.93 


UMiCH 


Collab 


CA-GrQc 


5,242 


28,968 


2.76 


SNAP 


Collab 


CA-HepPh 


12,008 


236,978 


9.87 


SNAP 


Collab 


CA-HepTh 


9,877 


51,946 


2.63 


SNAP 


Collab 


CA-NetSci 


1,463 


5,486 


1.87 


UMiCH 


Collab 


Enron E-Mail 


36,692 


367,662 


5.01 


SNAP 


E-Mail 


URV E-Mail 


1,133 


10,902 


4.81 


URV 


E-Mail 


YouTubel 


13,723 


153,530 


5.59 


ASU 


SocMedia 


YouTube2 


1,138,499 


5,980,886 


2.63 


ASU 


SocMedia 



TABLE I 

Information on the networks in Categories A, B, and C. 



F. Seed Size 

For each network, we performed 10 "integer" trials. In 
these trials, we set 9{vi) — min{dl",k) where k was kept 
constant among all vertices for each trial and set at an integer 
in the interval [1, 10]. We evaluated the ability of a network 
to promote spreading under the tipping model based on the 
size of the set of nodes returned by our algorithm (as a 
percentage of total nodes). For purposes of discussion, we have 
grouped our networks into three categories based on results 
(Figure [3] and Table [ll]|. In general, online social networks had 
the smallest seed sets - 13 networks of this type had an average 
seed set size less than 2% of the population. We also noticed, 
that for most networks, there was a linear realtion between 
threshold value and seed size. 
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Fig. 2. mlnn vs. runtime in seconds (log scale, m is number of edges, 
n is number of nodes). The relationship is linear with = 0.9015, p = 
2.2 ■ 10-16. 

Category A can be thought of as social networks highly 
susceptible to influence - as a very small fraction of individuals 
initially having a behavior can lead to adoption by the entire 
population. In our ten trials, the average seed size was under 
2% for each of these 13 networks. All were extracted from 
social media websites. For some of the lower threshold levels, 
the size of the set of seed nodes was particularly small. For a 
threshold of three we had 11 of the Category A networks with 
a seed size less than 0.5% of the population. For a threshold 
of four, we had nine networks meeting that criteria. 

Networks in Category B are susceptible to influence with a 
relatively small set of initial nodes - but not to the extent 
of those in Category A. They had an average initial seed 
size greater than 2% but less than 10%. Members in this 
group included two general purpose social media networks, 
two specialty social media networks, and an e-mail network. 

Category C consisted of networks that seemed to hamper 
diffusion in the tipping model, having an average initial seed 
size greater than 10%. This category included all of the 
academic collaboration networks, two of the email networks, 
and two networks derived from friendship links on YouTube. 

G. Seed Size as a Function of Community Structure 

In this section, we view the results of our heuristic algorithm 
as a measurement of how well a given network promotes 
spreading. Here, we use this measurement to gain insight into 
which structural aspects make a network more likely to be 
"tipped." We compared our results with two network-wide 
measures characterizing community structure. First, clustering 
coefficient (C) is defined for a node as the fraction of neighbor 
pairs that share an edge - making a triangle. For the undirected 
case, we define this concept formally below. 
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Thresliold Value 



Fig. 3. Threshold value (assigned as an integer in the interval [1, 10]) vs. size 
of initial seed set as returned by our algorithm in our three identified categories 
of networks (categories A-C are depicted in panels A-C respectively). Average 
seed sizes were under 2% for Categorty A, 2 — 10% for Category B and over 
10% for Category C. The relationship, in general, was linear for categories A 
and B and lograthimic for C. CA-NetSci had the largest Louvain Modularity 
and clustering coefficient of all the networks. This likely explains why that 
particular network seems to inhibit spreading. 



Definition 6 (Clustering Coefficient): Let r be the number 

of edges between nodes with which Vi has an edge and di be 

2r 

the degree of Vi. The clustering coefficient, d = — -. 

di[di - 1) 

Intuitively, a node with high Ci tends to have more pairs 
of friends that are also mutual friends. We use the average 
clustering coefficient as a network-wide measure of this local 
property. 

Second, we consider modularity (M) defined by Newman 
and Girvan. |T6) . For a partition of a network, M is a real 
number in [—1, 1] that measures the density of edges within 



partitions compared to the density of edges between partitions. 
We present a formal definition for an undirected network 
below. 



M 



Definition 7 (Modularity [16]): Modularity, 

1 „ didi 



1 



-]5{ci,Cj), where 



n is the 
Ci is the 



2^-^^i,jeVv- 2m 
number of undirected edges, di is node degree 
community to which u,; belongs and 5{x, y) — \ if x — y and 
otherwise. 

The modularity of an optimal network partition can be used 
to measure the quality of its community structure. Though 
modularity-maximization is NP-hard, the approximation algo- 
rithm of Blondel et al. f4] (a.k.a. the "Louvain algorithm") 
has been shown to produce near-optimal partitionsPl We call 
the modularity associated with this algorithm the "Louvain 
modularity." Unlike the C, which describes local properties, 
M is descriptive of the community level. For the 31 networks 
we considered, M and C appear uncorrected {R^ = 0.0538, 
p = 0.2092). 

We plotted the initial seed set size (S) (from our algorithm 
- averaged over the 10 threshold settings) as a function of 
M and C (Figure |4^) and uncovered a correlation (planar fit, 
= 0.8666, p 5.666- IQ-i^ see Figure|4]A). The majority 
of networks in Category C (less susceptible to spreading) 
were characterized by relatively large M and C (Category 
C includes the top nine networks w.r.t. C and top five w.rt. 
Al). Hence, networks with dense, segregated, and close-knit 
communities (large M and C) suppress spreading. Likewise, 
those with low M and C tended to promote spreading. Also, 
we note that there were networks that promoted spreading with 
dense and segregated communities, yet were less clustered (i.e. 
Category A networks Friendster and LiveJournal both have 
M > 0.65 and C < 0.13). Further, some networks with a 
moderately large clustering coefficient were also in Category 
A (two networks extracted from BlogCatalog had C > 0.46) 
but had a relatively less dense community structure (for those 
two networks M < 0.33). 

We also studied the effects on spreading when the threshold 
values would be assigned as a certain fraction of the node's 
in-degree. pT) , p2\ This results in heterogeneous 6','s for the 
nodes. We performed 12 trials for each network. Thresholds 
for each trial were based on the product of in-degree and a 
fraction in the interval [0.05, 0.60] (multiples of 0.05). The 
results (Figure |5] and Table were analogous to our integer 
tests. We also compared the averages over these trials with 
M and C and obtained similar results as with the other trials 
(Figure |4]B). 

V. Related Work 

Tipping models first became popular by the works of 
1 10 1 and pO) where it was presented primarily in a social 



context. Since then, several variants have been introduced in 



the literature including the non-deterministic version of |12| 
(described later in this section) and a generalized version of 

'Louvain modularity was computed using the implementation available 
from CRANS at http://perso.crans.org/aynaud/communities/. 
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Fig. 4. (A) Louvain modularity (A/) and average clustering coefficient (C) 
vs. the average seed size (5). The planar fit depicted is 5 = 43.374 ■ M + 
33.794 ■ C - 24.940 with = 0.8666, p = 5.666 ■ 10"". (B) Same 
plot at (A) except the averages are over the 12 percentage-based threshold 
values. The planar fit depicted is 5 = 18.105 • M + 17.257 ■ C - 10.388 
with i?2 = 0.816, p = 5.117 ■ lO"". 



Fig. 5. Threshold value (assigned as a fraction of node in-degree as a multiple 
of 0.05 in the interval [0.05, 0.60]) vs. size of initial seed set as returned by 
our algorithm in our three identified categories of networks (categories A-C 
are depicted in panels A-C respectively, categories are the same as in Figure 
1). Average seed sizes were under 5% for Categorty A, 1 — 7% for Category 
B and over 3% for Category C. In general, the relationship between threshold 
and initial seed size for networks in all categories was exponential. 



1 1 . In this paper we focused on the deterministic version. In 
1 22 , the authors look at deterministic tipping where each node 
is activated upon a percentage of neighbors being activated. 
Dryer and Roberts |9| introduce the MIN-SEED problem, 
study its complexity, and describe several of its properties 
w.r.t. certain special cases of graphs/networks. The hardness 
of approximation for this problem is described in |7|. The 
work of Q presents an algorithm for target-set selection 
whose complexity is determined by the tree-width of the 
graph - though it provides no experiments or evidence that 
the algorithm can scale for large datasets. The recent work of 
18i prove a non-trivial upper bound on the smallest seed set. 



Our algorithm is based on the idea of shell-decomposition 
that currently is prevalent in physics literature. In this process, 
which was introduced in |21], vertices (and their adjacent 
edges) are iteratively pruned from the network until a network 
"core" is produced. In the most common case, for some value 
fc, nodes whose degree is less than k are pruned (in order of 
degree) until no more nodes can be removed. This process 
was used to model the Internet in ||5) and find key spreaders 
under the SIR epidemic model in p3|. More recently, a 
"heterogeneous" version of decomposition was introduced in 
t2J - in which each node is pruned according to a certain 



Name 


Clust. Coeff. 


Louv. Mod. 


Int.-based Avg. Seed Size (%) 


R'^2 (linear fit for Int. tests) 


p-vatue (linearfitfor Int. tests) 


Deg.-based Avg. Seed Size (%) 


I/) 
V) 

CI 

Cub 

OJ 
Q 

O 

**- 

6. 

X 

S: 


p-value (exp. fit for Deg. tests) 


CATEGORY A 


BlogCatalogl 


0.35 


0.32 


0.73 


0.97 


1.4E-07 


1.01 


0.90 


2.15E-06 


BlogCatalog2 


0.49 


0.33 


0.01 


0.86 


l.lE-04 


0.69 


0.90 


2.25E-05 


BlogCatalogS 


0.46 


0.24 


0.29 


0.89 


3.9E-05 


3.62 


0.96 


1.42E-08 


Buzznet 


0.23 


0.31 


0.40 


0.83 


2.7E-04 


1.78 


0.93 


4.99E-07 


Douban 


0.02 


0.60 


1.54 


0.99 


3.2E-09 


1.73 


0.84 


2.76E-05 


Flickr 


0.17 


0.52 


0.69 


0.95 


1.2E-06 


3.11 


0.89 


3.89E-06 


Flixster 


0.08 


0.60 


1.14 


1.00 


l.lE-11 


0.98 


0.89 


5.0SE-06 


Foursquare 


0.11 


0.40 


0.07 


0.27 


1.2E-01 


0.44 


0.51 


9.50E-03 


Frienster 


0.05 


0.76 


0.21 


0.95 


1.2E-06 


0.42 


0.86 


1.38E-05 


Last.Fm 


0.07 


0.58 


1.31 


0.97 


1.2E-07 


0.93 


0.79 


1.19E-04 


LiveJournal 


0.13 


0.55 


1.12 


0.97 


1.4E-07 


1.09 


0.79 


1.22E-04 


Livemocha 


0.05 


0.35 


1.04 


0.89 


3.6E-05 


2.31 


0.90 


2.99E-05 


WikiTalk 


0.05 


0.58 


0.90 


0.98 


8.0E-08 


0.37 


0.82 


5.56E-05 


CATEGORY B 


Delicious 


0.03 


0.75 


8.27 


0.98 


2.9E-08 


2.87 


0.86 


1.5E-05 


Digg 


0.09 


0.53 


464 


0.98 


2.0E-08 


1.10 


0.73 


3.8E-04 


EU E-Mail 


0.07 


0.79 


6.65 


0.81 


3.8E-04 


6.48 


0.95 


5.8E-08 


Hyves 


0.04 


0.77 


4.90 


0.97 


1.5E-07 


2.10 


0.79 


1.2E-04 


Yelp 


0.11 


0.52 


7.07 


0.99 


2.2E-10 


1.44 


0.70 


7.2E-04 


CATEGORY C 


CA-AstroPh 


0.63 


0.53 


1431 


1.00 


5.3E-11 


8.53 


0.89 


3.4E-06 


CA-CondMat03 


0.65 


0.76 


27.80 


0.98 


7.8E-08 


12.45 


0.92 


8.7E-07 


CA-CondMat03a 


0.63 


0.73 


26.52 


0.98 


2.3E-08 


11.62 


0.91 


1.2E-06 


CA-CondMat05 


0.65 


0.73 


25.59 


0.98 


2.8E-08 


11.26 


0.91 


1.6E-06 


CA-CondMat99 


0.64 


0.85 


3471 


0.95 


1.3E-06 


15.48 


0.93 


3.0E-07 


CA-GrQc 


0.53 


0.86 


35.09 


0.92 


1.2E-05 


16.86 


0.92 


8.1E-07 


CA-HepPh 


0.61 


0.56 


21.35 


0.98 


1.8E-08 


10.59 


0.91 


1.2E-06 


CA-HepTh 


0.47 


0.77 


30.63 


0.95 


1.3E-06 


12.47 


0.89 


4.1E-06 


CA-NetSci 


0.69 


0.96 


50.69 


0.82 


3.0E-04 


29.22 


0.93 


5.5E-07 


Enron E-Mail 


0.50 


0.52 


18.15 


0.95 


1.3E-06 


7.64 


0.90 


2.5E-06 


URV E-Mail 


0.22 


0.57 


13.17 


0.97 


1.5E-07 


5.54 


0.87 


9.8E-06 


YouTubel 


0.14 


0.57 


11.21 


0.98 


4.8E-08 


4.24 


0.86 


1.3E-05 


YouTube2 


0.08 


0.72 


16.06 


0.87 


7.9E-05 


3.73 


0.79 


1.2E-04 



TABLE II 

Regression analysis and network-wide measures for the 

NETWORKS in CATEGORIES A, B, AND C. 



parameter - and the process is studied in that work based on 
a probability distribution of nodes with certain values for this 
parameter 

A. Notes on Non-Deterministic Tipping 

We also note that an alternate version of the model where 
the thresholds are assigned randomly has inspired approxi- 
mation schemes for the corresponding version of the seed 
set problem. [8j, | [T2| , |[15| Work in this area focused on 
finding a seed set of a certain size that maximizes of the 
expected number of adopters. The main finding by Kempe 
et al., the classic work for this model, was to prove that the 
expected number of adopters was submodular - which allowed 
for a greedy approximation scheme. In this algorithm, at each 



iteration, the node which allows for the greatest increase in the 
expected number of adopters is selected. The approximation 
guarantee obtained (less than 0.63 of optimal) is contingent 
upon an approximation guarantee for determining the expected 
number of adopters - which was later proved to be #P- 
hard. |8| Though finding a such a guarantee is still an open 
question, work on counting-complexity problems such as that 
of Dan Roth 1 19 1 indicate that a non-trivial approximation ratio 
is unlikely. Further, the simulation operation is often expensive 
- causing the overall time complexity to be 0{x -n^) where x 
is the number of runs per simulation and n is the number of 
nodes (typically, x > n). In order to avoid simulation, various 
heuristics have been proposed, but these typically rely on the 
computation of geodesies - an 0{n^) operation - which is also 
more expensive than our approach. 

Additionally, the approximation argument for the non- 
deterministic case does not directly apply to the original (de- 
terministic) model presented in this paper. A simple counter- 
example shows that sub-modularity does not hold here. Sub- 
modularity (diminishing returns) is the property leveraged by 
Kempe et al. in their approximation result. 

B. Note on an Upper Bound of the Initial Seed Set 

Very recently, we were made aware of research by Daniel 
Reichman that proves an upper bound on the minimal size of 
a seed set for the special case of undirected networks with 
homogeneous threshold values. [18J The proof is constructive 
and yields an algorithm that mirrors our approach (although 
Reicshman's algorithm applies only to that special case). We 
note that our work and the work of Reichman were devel- 
oped independently. We also note that Reichman performs no 
experimental evaluation of the algorithm. 

Given undirected network G where each node Vi has degree 
di and the threshold value for all nodes is k, Reichman proves 
that the size of the minimal seed set can be bounded by 
minjl, jq-j-}. For our integer tests, we compared our 
results to Reichman's bound. Our seed sets were considerably 
smaller - often by an order of magnitude or more. See Figure |6] 
for details. 

VI. Conclusion 

As recent empirical work on tipping indicates that it can 
occur in real social networks, |]6), | |24l our results are en- 
couraging for viral marketers. Even if we assume relatively 
large threshold values, small initial seed sizes can often be 
found using our fast algorithm - even for large datasets. For 
example, with the Foursquare online social network, under 
majority threshold (50% of incoming neighbors previously 
adopted), a viral marketeer could expect a 297-fold return on 
investment. As results of this type seem to hold for many 
online social networks, our algorithm seems to hold promise 
for those wishing to "go viral." 
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